Introduction

...

On Crowdflower, each revision is rated 10 times. The raters are given three questions:

  1. Is this comment not English or not human readable?
    • Column 'na'
  2. How aggressive or friendly is the tone of this comment?
    • Column 'how_aggressive_or_friendly_is_the_tone_of_this_comment'
    • Ranges from '---' (Very Aggressive) to '+++' (Very Friendly)
  3. Does the comment contain a personal attack or harassment? Please mark all that apply:
    • Column 'is_harassment_or_attack'
    • Users can specify that the attack is:
      • Targeted at the recipient of the message (i.e. you suck). ('recipent')
      • Targeted at a third party (i.e. Bob sucks). ('third_party')
      • Being reported or quoted (i.e. Bob said Henri sucks). ('quoting')
      • Another kind of attack or harassment. ('other')
      • This is not an attack or harassment. ('not_attack')

Loading packages and data


In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
from __future__ import division
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from crowdflower_analysis import *
from krippendorf_alpha import *
from krippendorf_alpha_grrrr import *

In [2]:
pd.set_option('display.max_colwidth', 1000)

In [3]:
dat = pd.read_csv('../../../../data/annotations/nda/nda onion layer 5 raters 10.csv')

In [4]:
dat = dat[dat['_golden'] == False]
# Replace missing data with 'False'
dat = dat.replace(np.nan, False, regex=True)
attack_columns = ['not_attack', 'other', 'quoting', 'recipient', 'third_party']
for col in attack_columns:
    dat[col] = create_column_of_counts(dat['is_harassment_or_attack'], col)

In [5]:
chosen_ids = set(dat['rev_id'].unique()[0:1000])

In [6]:
sub_dat = dat[dat['rev_id'].apply(lambda x: x in chosen_ids)]

In [7]:
groups = sub_dat.groupby('_worker_id')

In [8]:
data = []
for g in groups:
    df =g[1][['rev_id', 'recipient']]
    d ={}
    for i, row in df.iterrows():
        d[row['rev_id']] = row['recipient']
    data.append(d)

In [9]:
krippendorff_alpha(data, metric = nominal_metric)


Out[9]:
0.45132419296394688

In [10]:
cleaned_df = clean_df(sub_dat)

In [11]:
Krippendorf_alpha(cleaned_df, ['not_attack_0', 'not_attack_1'])


Out[11]:
0.47022523695316831

In [12]:
'''
for key in grouped_dat.keys():
    print "Krippendorf's Alpha (aggressiveness) for layer %s: " % key
    print Krippendorf_alpha(grouped_dat[key], aggressive_columns, distance = interval_distance)
    print "Krippendorf's Alpha (attack) for layer %s: " % key
    print Krippendorf_alpha(grouped_dat[key], ['not_attack_0', 'not_attack_1'])
'''


Out[12]:
'\nfor key in grouped_dat.keys():\n    print "Krippendorf\'s Alpha (aggressiveness) for layer %s: " % key\n    print Krippendorf_alpha(grouped_dat[key], aggressive_columns, distance = interval_distance)\n    print "Krippendorf\'s Alpha (attack) for layer %s: " % key\n    print Krippendorf_alpha(grouped_dat[key], [\'not_attack_0\', \'not_attack_1\'])\n'